Spurious Solutions to the Bellman Equation

نویسندگان

  • Mance E. Harmon
  • Leemon C. Baird
چکیده

Reinforcement learning algorithms often work by finding functions that satisfy the Bellman equation. This yields an optimal solution for prediction with Markov chains and for controlling a Markov decision process (MDP) with a finite number of states and actions. This approach is also frequently applied to Markov chains and MDPs with infinite states. We show that, in this case, the Bellman equation may have multiple solutions, many of which lead to erroneous predictions and policies (Baird, 1996). Algorithms and conditions are presented that guarantee a single, optimal solution to the Bellman equation. 1 REINFORCEMENT LEARNING AND DYNAMIC PROGRAMMING 1.1 THE BELLMAN EQUATION Reinforcement learning algorithms often work by using some form of dynamic programming to find functions that satisfy the Bellman equation. For example, in a pure prediction problem, the true, optimal value of a state, V*(xt), is defined as equation (1), where < > represents the expected value, taken over all possible sequences of states after time t, γ is a discount factor between zero and one exclusive, and R is the reinforcement received on each time step. V * (xt ) = Rt + γRt+1 + γ 2Rt + 2 + γ 3Rt +3 + Κ (1) It is clear from equation (1) that there is a simple relationship between successive states. This relationship is given in equation (2), and is referred to as the Bellman equation for this problem. V * (xt ) = Rt + γV * (x t+1 ) (2) Bellman equations can be derived similarly for other algorithms such as Q-learning (Watkins, 1989) or advantage learning (Baird, 1993, Harmon and Baird, 1996). 1.2 UNIQUE SOLUTIONS A learning system will maintain an approximation V to the true answer V*, and the difference between the two can be called the error e, defined in equation (3). Equation (4) shows why dynamic programming works. If the learning system can find a function V that satisfies equation (2) for all states, then equation (4) will also hold for all states. V(x t ) = V * (x t ) + e(x t ) (3) V * (xt ) + e(xt ) = Rt + γ V *(xt +1) + e(x t+1) ( ) V * (xt ) + e(xt ) = Rt + γV * (xt +1 ) + γ e(xt +1 ) e(xt ) = γ e(xt +1) (4) Suppose there are a finite number of states, and call the state with the largest error state xt. The discount factor γ is a positive number less than 1, so equation (4) says that the largest error is equal to only a fraction of a weighted average of all the errors. The only way this could happen is if all the errors were zero. Thus, for a finite number of states, the Bellman equation has a unique solution, and that solution is optimal. On the basis of this result, reinforcement learning systems have been created that simply try to find a function V that satisfies the Bellman equation (e.g., Tesauro, 1994; Crites and Barto, 1995). But will such a V will be optimal, even when there are an infinite number of states? Can we assume that the finite-state results will also apply to the infinite-state case? 2 SPURIOUS SOLUTIONS It would be useful to determine under what conditions dynamic programming is guaranteed to find not only a value function that satisfies equation (4), but also a value function whose value error, defined in equation (5), is zero for all x. e(x) = V *(x) − V(x) (5) One solution to equation (4) is the optimal value function V*. However, in many cases there might exist more than a single, unique solution to the Bellman equation (Baird, 1996). If there is a finite number of states then there does exist a unique solution to equation (4). If there is an infinite number of states then there may exist an infinite number of solutions to the Bellman equation, including some with a suboptimal value function or policy. 2.1 THE INFINITE-HALL PROBLEM Consider the simple case of a Markov chain with countably-infinite states, named 0, 1, 2, ..., and with a reinforcement of zero on every transition (Figure 1). 1 2• 3 4 5 0 0 0 0 0 Figure 1: Infinite Markov chain On each time step, the state number is increased by one. The Bellman equation, error relationship, and general solution for this Markov chain are given in equations (6), (7), and (8) respectively. V(x t) = γV(xt +1) (6) e(xt ) = γe(x t) (7)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A nonlinear second order field equation – similarity solutions and relation to a Bellmann-type equation - Applications to Maxwellian Molecules

In this paper Lie’s formalism is applied to deduce classes of solutions of a nonlinear partial differential equation (nPDE) of second order with quadratic nonlinearity. The equation has the meaning of a field equation appearing in the formulation of kinetic models. Similarity solutions and transformations are given in a most general form derived to the first time in terms of reciprocal Jacobian...

متن کامل

Positivity-preserving nonstandard finite difference Schemes for simulation of advection-diffusion reaction equations

Systems in which reaction terms are coupled to diffusion and advection transports arise in a wide range of chemical engineering applications, physics, biology and environmental. In these cases, the components of the unknown can denote concentrations or population sizes which represent quantities and they need to remain positive. Classical finite difference schemes may produce numerical drawback...

متن کامل

Evolutionary Programming as a Solution Technique for the Bellman Equation

Evolutionary programming is a stochastic optimization procedure which has proved useful in optimizing difficult functions. It is shown that evolutionary programing can be used to solve the Bellman equation problem with a high degree of accuracy and substantially less CPU time than Bellman equation iteration. Future applications will focus on sometimes binding constraints – a class of problem fo...

متن کامل

A family of positive nonstandard numerical methods with application to Black-Scholes equation

Nonstandard finite difference schemes for the Black-Scholes partial differential equation preserving the positivity property are proposed. Computationally simple schemes are derived by using a nonlocal approximation in the reaction term of the Black-Scholes equation. Unlike the standard methods, the solutions of new proposed schemes are positive and free of the spurious oscillations.

متن کامل

Monotone concave operators: An application to the existence and uniqueness of solutions to the Bellman equation∗

We propose a new approach to the issue of existence and uniqueness of solutions to the Bellman equation, exploiting an emerging class of methods, called monotone map methods, pioneered in the work of Krasnosel’skii (1964) and Krasnosel’skii-Zabreiko (1984). The approach is technically simple and intuitive. It is derived from geometric ideas related to the study of fixed points for monotone conc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999